IEICE globals.ieice.org Site

Keyword Search Result

[Keyword] parallel process(118hit)

21-40hit(118hit)

Optimized Implementation of Pedestrian Tracking Using Multiple Cues on GPU
Ryusuke MIYAMOTO Hiroki SUGANO

PAPER-Image Processing

Vol:
E94-A No:11
Page(s):
2323-2333
Nowadays, pedestrian recognition for automotive and security applications that require accurate recognition in images taken from distant observation points is a recent challenging problem in the field of computer vision. To achieve accurate recognition, both detection and tracking must be precise. For detection, some excellent schemes suitable for pedestrian recognition from distant observation points are proposed, however, no tracking schemes can achieve sufficient performance. To construct an accurate tracking scheme suitable for pedestrian recognition from distant observation points, we propose a novel pedestrian tracking scheme using multiple cues: HSV histograms and HOG features. Experimental results show that the proposed scheme can properly track a target pedestrian where tracking schemes using only a single cue fails. Moreover, we implement the proposed scheme on NVIDIA® TeslaTM C1060 processor, one of the latest GPU, to achieve real-time processing of the proposed scheme. Experimental results show that computation time required for tracking of a frame by our implementation is reduced to 8.80 ms even though Intel® CoreTM i7 CPU 975 @ 3.33 GHz spends 111 ms.
High-Speed FPGA Implementation of the SHA-1 Hash Function
Je-Hoon LEE Sang-Choon KIM Young-Jun SONG

LETTER-Cryptography and Information Security

Vol:
E94-A No:9
Page(s):
1873-1876
This paper presents a high-speed SHA-1 implementation. Unlike the conventional unfolding transformation, the proposed unfolding transformation technique makes the combined hash operation blocks to have almost the same delay overhead regardless of the unfolding factor. It can achieve high throughput of SHA-1 implementation by avoiding the performance degradation caused by the first hash computation. We demonstrate the proposed SHA-1 architecture on a FPGA chip. From the experimental results, the SHA-1 architecture with unfolding factor 5 shows 1.17 Gbps. The proposed SHA-1 architecture can achieve about 31% performance improvements compared to its counterparts. Thus, the proposed SHA-1 can be applicable for the security of the high-speed but compact mobile appliances.
A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences
Woong-Kee LOH Yang-Sae MOON Wookey LEE

PAPER-Fundamentals of Information Systems

Vol:
E94-D No:7
Page(s):
1369-1377
Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.
NUFFT- & GPU-Based Fast Imaging of Vegetation
Amedeo CAPOZZOLI Claudio CURCIO Antonio DI VICO Angelo LISENO

PAPER-Sensing

Vol:
E94-B No:7
Page(s):
2092-2103
We develop an effective algorithm, based on the filtered backprojection (FBP) approach, for the imaging of vegetation. Under the FBP scheme, the reconstruction amounts at a non-trivial Fourier inversion, since the data are Fourier samples arranged on a non-Cartesian grid. The computational issue is efficiently tackled by Non-Uniform Fast Fourier Transforms (NUFFTs), whose complexity grows asymptotically as that of a standard FFT. Furthermore, significant speed-ups, as compared to fast CPU implementations, are obtained by a parallel versions of the NUFFT algorithm, purposely designed to be run on Graphic Processing Units (GPUs) by using the CUDA language. The performance of the parallel algorithm has been assessed in comparison to a CPU-multicore accelerated, Matlab implementation of the same routine, to other CPU-multicore accelerated implementations based on standard FFT and employing linear, cubic, spline and sinc interpolations and to a different, parallel algorithm exploiting a parallel linear interpolation stage. The proposed approach has resulted the most computationally convenient. Furthermore, an indoor, polarimetric experimental setup is developed, capable to isolate and introduce, one at a time, different non-idealities of a real acquisition, as the sources (wind, rain) of temporal decorrelation. Experimental far-field polarimetric measurements on a thuja plicata (western redcedar) tree point out the performance of the set up algorithm, its robustness against data truncation and temporal decorrelation as well as the possibility of discriminating scatterers with different features within the investigated scene.
Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors
Wan Yeon LEE Kyong Hoon KIM

LETTER-Systems and Control

Vol:
E94-A No:2
Page(s):
842-845
The proposed scheduling scheme minimizes the mean energy consumption of a real-time parallel task, where the task has the probabilistic computation amount and can be executed concurrently on multiple cores. The scheme determines a pertinent number of cores allocated to the task execution and the instant frequency supplied to the allocated cores. Evaluation shows that the scheme saves manifest amount of the energy consumed by the previous method minimizing the mean energy consumption on a single core.
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine
Hiroyuki GOTO

PAPER-Fundamentals of Information Systems

Vol:
E93-D No:7
Page(s):
1798-1806
This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband EngineTM (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3TM equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
An Optimization System with Parallel Processing for Reducing Common-Mode Current on Electronic Control Unit
Yuji OKAZAKI Takanori UNO Hideki ASAI

PAPER

Vol:
E93-C No:6
Page(s):
827-834
In this paper, we propose an optimization system with parallel processing for reducing electromagnetic interference (EMI) on electronic control unit (ECU). We adopt simulated annealing (SA), genetic algorithm (GA) and taboo search (TS) to seek optimal solutions, and a Spice-like circuit simulator to analyze common-mode current. Therefore, the proposed system can determine the adequate combinations of the parasitic inductance and capacitance values on printed circuit board (PCB) efficiently and practically, to reduce EMI caused by the common-mode current. Finally, we apply the proposed system to an example circuit to verify the validity and efficiency of the system.
Column-Parallel Vision Chip Architecture for High-Resolution Line-of-Sight Detection Including Saccade
Junichi AKITA Hiroaki TAKAGI Keisuke DOUMAE Akio KITAGAWA Masashi TODA Takeshi NAGASAKI Toshio KAWASHIMA

PAPER-Image Sensor/Vision Chip

Vol:
E90-C No:10
Page(s):
1869-1875
Although the line-of-sight (LoS) is expected to be useful as input methodology for computer systems, the application area of the conventional LoS detection system composed of video camera and image processor is restricted in the specialized area, such as academic research, due to its large size and high cost. There is a rapid eye motion, so called 'saccade' in our eye motion, which is expected to be useful for various applications. Because of the saccade's very high speed, it is impossible to track the saccade without using high speed camera. The authors have been proposing the high speed vision chip for LoS detection including saccade based on the pixel parallel processing architecture, however, its resolution is very low for the large size of its pixel. In this paper, we propose and discuss an architecture of the vision chip for LoS detection including saccade based on column-parallel processing manner for increasing the resolution with keeping high processing speed.
Media Processing LSI Architectures for Automotives -- Challenges and Future Trends --
Ichiro KURODA Shorin KYO

INVITED PAPER

Vol:
E90-C No:10
Page(s):
1850-1857
This paper presents media processor architectures for automotive applications. Media processing applications with their requirements for LSI implementations are first described for vision based driver assistance as well as graphical user interface for car navigation using 3D graphics. Then, parallel processing architectures for vision and graphics in these applications are reviewed with their performance and cost. After that, future trends of automotive media processing such as integration of vision and 3D graphics functions are shown with their applications and the required performance. Moreover, parallel processing architectures are discussed for the integration of vision and graphics. Finally, an prospect of a next-generation media processing LSI for automotives is provided.
Hamiltonian Cycles and Hamiltonian Paths in Faulty Burnt Pancake Graphs
Keiichi KANEKO

PAPER-Algorithm Theory

Vol:
E90-D No:4
Page(s):
716-721
Recently, research on parallel processing systems is very active, and many complex topologies have been proposed. A burnt pancake graph is one such topology. In this paper, we prove that a faulty burnt pancake graph with degree n has a fault-free Hamiltonian cycle if the number of the faulty elements is n-2 or less, and it has a fault-free Hamiltonian path between any pair of nonfaulty nodes if the number of the faulty elements is n-3 or less.
Real-Time Huffman Encoder with Pipelined CAM-Based Data Path and Code-Word-Table Optimizer
Takeshi KUMAKI Yasuto KURODA Masakatsu ISHIZAKI Tetsushi KOIDE Hans Jurgen MATTAUSCH Hideyuki NODA Katsumi DOSAKA Kazutami ARIMOTO Kazunori SAITO

PAPER-Image Processing and Video Processing

Vol:
E90-D No:1
Page(s):
334-345
This paper presents a novel optimized real-time Huffman encoder using a pipelined data path based on CAM technology and a parallel code-word-table optimizer. The exploitation of CAM technology enables fast parallel search of the code word table. At the same time, the code word table is optimized according to the frequency of received input symbols and is up-dated in real-time. Since these two functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. Evaluation results for the JPEG application show that the proposed architecture can achieve up to 28% smaller encoded picture sizes than the conventional architectures. The obtained encoding time can be reduced by 95% in comparison to a conventional SRAM-based architecture, which is suitable even for the latest end-user-devices requiring fast frame-rates. Furthermore, the proposed architecture provides the only encoder that can simultaneously realize small compressed data size and fast processing speed.
Scalable FPGA/ASIC Implementation Architecture for Parallel Table-Lookup-Coding Using Multi-Ported Content Addressable Memory
Takeshi KUMAKI Yutaka KONO Masakatsu ISHIZAKI Tetsushi KOIDE Hans Jurgen MATTAUSCH

PAPER-Image Processing and Video Processing

Vol:
E90-D No:1
Page(s):
346-354
This paper presents a scalable FPGA/ASIC implementation architecture for high-speed parallel table-lookup-coding using multi-ported content addressable memory, aiming at facilitating effective table-lookup-coding solutions. The multi-ported CAM adopts a Flexible Multi-ported Content Addressable Memory (FMCAM) technology, which represents an effective parallel processing architecture and was previously reported in [1]. To achieve a high-speed parallel table-lookup-coding solution, FMCAM is improved by additional schemes for a single search mode and counting value setting mode, so that it permits fast parallel table-lookup-coding operations. Evaluation results for Huffman encoding within the JPEG application show that a synthesized semi-custom ASIC implementation of the proposed architecture can already reduce the required clock-cycle number by 93% in comparison to a conventional DSP. Furthermore, the performance per area unit, measured in MOPS/mm2, can be improved by a factor of 3.8 in comparison to parallel operated DSPs. Consequently, the proposed architecture is very suitable for FPGA/ASIC implementation, and is a promising solution for small area integrated realization of real-time table-lookup-coding applications.
Design and Evaluation of a Massively Parallel Processor Based on Matrix Architecture
Toru SHIMIZU Masami NAKAJIMA Masahiro KAINAGA

INVITED PAPER

Vol:
E89-C No:11
Page(s):
1512-1518
This paper describes the design and evaluation of a massively parallel processor base on Matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves 40 GOPS of 16-bit fixed-point additions at 200 MHz clock frequency and 250 mW power dissipation. In addition, 1 M-bit SRAM for data registers and 2,048 2-bit processing elements connected by a flexible switching network are integrated in 3.1 mm2 in 90 nm low-power CMOS technology. The energy-efficient Matrix architecture supports 2,048-way parallel operations and the programmable functions required for multimedia SoCs.
A Power- and Area-Efficient SRAM Core Architecture with Segmentation-Free and Horizontal/Vertical Accessibility for Super-Parallel Video Processing
Junichi MIYAKOSHI Yuichiro MURACHI Tomokazu ISHIHARA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO

PAPER

Vol:
E89-C No:11
Page(s):
1629-1636
For super-parallel video processing, we proposed a power- and area-efficient SRAM core architecture with a segmentation-free access, which means accessibility to arbitrary consecutive pixels, and horizontal/vertical access. To achieve these flexible accesses, a spirally-connected local-wordline select signal and multi-selection scheme in wordlines are proposed, so that extra X-decoders in the conventional multi-division SRAM can be eliminated. Consequently, the proposed SRAM reduces a power and area by 57-60% and 60%, respectively, when it is applied to a 128 parallel architecture. The proposed 160-kbit SRAM with 16-read ports (2-read port SRAM with eight-parallel architecture) is implemented to a search window buffer for an H.264 motion estimation processor core which dissipates 800 µW for QCIF 15-fps in a 130-nm technology.
Vision Chip Architecture for Detecting Line of Sight Including Saccade
Junichi AKITA Hiroaki TAKAGI Takeshi NAGASAKI Masashi TODA Toshio KAWASHIMA Akio KITAGAWA

PAPER

Vol:
E89-C No:11
Page(s):
1605-1611
Rapid eye motion, or so called saccade, is a very quick eye motion which always occurs regardless of our intention. Although the line of sight (LOS) with saccade tracking is expected to be used for a new type of computer-human interface, it is impossible to track it using the conventional video camera, because of its speed which is often up to 600 degrees per second. Vision Chip is an intelligent image sensor which has the photo receptor and the image processing circuitry on a single chip, which can process the acquired image information by keeping its spatial parallelism. It has also the ability of implementing the very compact integrated vision system. In this paper, we describe the vision chip architecture which has the capability of detecting the line of sight from infrared eye image, with the processing speed supporting the saccade tracking. The vision chip described here has the pixel parallel processing architecture, with the node automata for each pixel as image processing. The acquired image is digitized to two flags indicating the Purkinje's image and the pupil by comparators at first. The digitized images are then shrunk, followed by several steps of expanding by node automata located at each pixel. The shrinking process is kept executed until all the pixels disappear, and the pixel disappearing at last indicates the center of the Purkinje's image and the pupil. This disappearing step is detected by the projection circuitry in pixel circuit for fast operation, and the coordinates of the center of the Purkinje's image and the pupil are generated by the simple encoders. We describe the whole architecture of this vision chip, as well as the pixel architecture. We also describe the evaluation of proposed algorithm with numerical simulation, as well as processing speed using FPGA, and improvement in resolution using column parallel architecture.
Boundary-Active-Only Adaptive Power-Reduction Scheme for Region-Growing Video-Segmentation
Takashi MORIMOTO Hidekazu ADACHI Osamu KIRIYAMA Tetsushi KOIDE Hans Jurgen MATTAUSCH

LETTER-Image Processing and Video Processing

Vol:
E89-D No:3
Page(s):
1299-1302
This letter presents a boundary-active-only (BAO) power reduction technique for cell-network-based region-growing video segmentation. The key approach is an adaptive situation-dependent power switching of each network cell, namely only cells at the boundary of currently grown regions are activated, and all the other cells are kept in low-power stand-by mode. The effectiveness of the proposed technique is experimentally confirmed with CMOS test-chips having small-scale cell networks of up to 4133 cells, where an average of only 1.7% of the cells remains active after application of the proposed approach. About 85% power reduction is thus achievable without sacrificing real-time processing.
A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers
Xizhen XU Sotirios G. ZIAVRAS

PAPER-Parallel/Distributed Algorithms

Vol:
E89-D No:2
Page(s):
639-646
FPGAs (Field-Programmable Gate Arrays) have been widely used as coprocessors to boost the performance of data-intensive applications [1],[2]. However, there are several challenges to further boost FPGA performance: the communication overhead between the host workstation and the FPGAs can be substantial; large-scale applications cannot fit in a single FPGA because of its limited capacity; mapping an application algorithm to FPGAs still remains a daunting job in configurable system design. To circumvent these problems, we propose in this paper the FPGA-based Hierarchical-SIMD (H-SIMD) machine with its codesign of the Pyramidal Instruction Set Architecture (PISA). PISA comprises high-level instructions implemented as FPGA functions of coarse-grain SIMD (Single-Instruction, Multiple-Data) tasks to facilitate ease of program development, code portability across different H-SIMD implementations and high performance. We assume a multi-FPGA board where each FPGA is configured as a separate SIMD machine. Multiple FPGA chips can work in unison at a higher SIMD level, if needed, controlled by the host. Additionally, by using a memory switching scheme and the high-level PISA to partition applications into coarse-grain tasks, host-FPGA communication overheads can be hidden. We enlist the two-dimensional Fast Fourier Transform (2D FFT) to test the effectiveness of H-SIMD. The test results show sustained high performance for this problem. The H-SIMD machine even outperforms a Xeon processor for this problem.
Entropy Based Evaluation of Communication Predictability in Parallel Applications
Alex K. JONES Jiang ZHENG Ahmed AMER

PAPER-Performance Evaluation

Vol:
E89-D No:2
Page(s):
469-478
The performance of parallel computing applications is highly dependent on the efficiency of the underlying communication operations. While often characterized as dynamic, these communication operations frequently exhibit spatial and temporal locality as well as regularity in structure. These characteristics can be exploited to improve communication performance if the correct prediction model is selected to a suitable communication topology. In this paper we describe an entropy based methodology for quantifying and evaluating the success of different prediction models on actual workloads drawn from representative parallel benchmarks. We evaluate two different prediction criteria and combinations thereof: (1) Messages are partitioned by source node. (2) Use of a first order context model. We also describe the threshold for predication designed to largely avoid incorrect predication overheads. Our results show for simple predication models, even on highly dynamic benchmark applications, predictability can be improved by several orders of magnitude. In fact, using simple prediction techniques, over 75% of the communication volume is accurately predictable.
A Resource-Shared VLIW Processor for Low-Power On-Chip Multiprocessing in the Nanometer Era
Kazutoshi KOBAYASHI Masao ARAMOTO Hidetoshi ONODERA

PAPER-Digital

Vol:
E88-C No:4
Page(s):
552-558
We propose a low-power resource-shared VLIW processor (RSVP) for future leaky nanometer process technologies. It consists of several single-way independent processor units (IPUs) that share parallel processor resources. Each IPU works as a variable-way VLIW processor sharing the parallel resources according to priorities of given tasks. RSVP allocates shared parallel resources to the IPUs cycle by cycle. It can minimize the number of NOPs that is wasting power. The performance per power (P3) of a 4-parallel 4-way RSVP that corresponds to four 4way VLIWs is 3.7% better than a conventional 4-parallel 4-way VLIW multiprocessor in the current 90 nm process. We estimate that the RSVP achieves 36% less leakage power and 28% better P3 in the future 25 nm process. We have fabricated an RSVP test chip that contains two IPU and a shared resource equivalent to two 2way VLIWs in a 180 nm process. It is functional at 100 MHz clock speed and its power is 130 mW.
An MAMS-PP4: Multi-Access Memory System Used to Improve the Processing Speed of Visual Media Applications in a Parallel Processing System
Hyung LEE Hyeon-Koo CHO Dae-Sang YOU Jong-Won PARK

PAPER-Concurrent Systems

Vol:
E87-A No:11
Page(s):
2852-2858
To fulfill the computing demands in visual media processing, we have been investigating a parallel processing system to improve the processing speed of the visual media related to applications from the point of view of a memory system within a single instruction multiple data (SIMD) computer. In this paper, we have introduced MAMS-PP4, which is similar to a pipelined SIMD architecture type and consists of pq processing elements (PEs) as well as a multi-access memory system (MAMS). MAMS supports simultaneous access to pq data elements within a horizontal (1 pq), a vertical (pq 1) or a block (p q) subarray with a constant interval in an arbitrary position in an M N array of data elements, where the number of memory modules, m, is a prime number greater than pq. MAMS reduces the memory access time for an SIMD computer and also improves the cost and complexity that involved in controlling the large volume of data demanded in visual media applications. PE is designed to be a two-state machine in order to utilize MAMS efficiently. MAMS-PP4 was fabricated into ASIC using TOSHIBA TC240C series library and a test board was used to measure the performance of ASIC. The test board consists of devices such as an MPC860 embedded-PCI board, two ASICs and a FPGA for the control units. Experiment was done on various computer systems in order to compare the performance of MAMS-PP4 using morphological operations as the application. MAMS-PP4 shows a respectful and consistent processing speed.

21-40hit(118hit)

Keyword Search Result

[Keyword] parallel process(118hit)

Optimized Implementation of Pedestrian Tracking Using Multiple Cues on GPU

High-Speed FPGA Implementation of the SHA-1 Hash Function

A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

NUFFT- & GPU-Based Fast Imaging of Vegetation

Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors

High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine

An Optimization System with Parallel Processing for Reducing Common-Mode Current on Electronic Control Unit

Column-Parallel Vision Chip Architecture for High-Resolution Line-of-Sight Detection Including Saccade

Media Processing LSI Architectures for Automotives -- Challenges and Future Trends --

Hamiltonian Cycles and Hamiltonian Paths in Faulty Burnt Pancake Graphs

Real-Time Huffman Encoder with Pipelined CAM-Based Data Path and Code-Word-Table Optimizer

Scalable FPGA/ASIC Implementation Architecture for Parallel Table-Lookup-Coding Using Multi-Ported Content Addressable Memory

Design and Evaluation of a Massively Parallel Processor Based on Matrix Architecture

A Power- and Area-Efficient SRAM Core Architecture with Segmentation-Free and Horizontal/Vertical Accessibility for Super-Parallel Video Processing

Vision Chip Architecture for Detecting Line of Sight Including Saccade

Boundary-Active-Only Adaptive Power-Reduction Scheme for Region-Growing Video-Segmentation

A Coarse-Grain Hierarchical Technique for 2-Dimensional FFT on Configurable Parallel Computers

Entropy Based Evaluation of Communication Predictability in Parallel Applications

A Resource-Shared VLIW Processor for Low-Power On-Chip Multiprocessing in the Nanometer Era

An MAMS-PP4: Multi-Access Memory System Used to Improve the Processing Speed of Visual Media Applications in a Parallel Processing System

Latest Issue

FlyerIEICE has prepared a flyer regarding multilingual services. Please use the one in your native language.

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles